Skip to content

Add support for "true" cursor based pagination in connections #730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

diesieben07
Copy link
Contributor

@diesieben07 diesieben07 commented Apr 9, 2025

This pull request adds a new connection class DjangoCursorConnection which supports efficient cursor-based pagination through any Django QuerySet without relying on offset-slicing.

Description

ListConnection uses slicing to achieve pagination. This works for the general case, but can be inefficient for large datasets, because large page numbers result in a large OFFSET in SQL. An alternative to limit/offset pagination is cursor based pagination, which replaces OFFSET by range queries such as Q(due_date__gt=...). DjangoCursorConnection implements this approach.

How it works

DjangoCursorConnection inspects the QuerySet and extracts its ordering parameters. It then uses those parameters to construct the cursors. For example, for order_by("due_date", "pk") the ordering parameters would be due_date and pk. If the ordering parameter is an expression or not a direct field on the model (e.g. order_by(Upper("name")) or order_by("project__name"), a new annotation will be added to the queryset, mirroring the ordering expression, so that the value can be extracted later.
The extracted values are then encoded into a cursor.
When paginating, the cursor is deconstructed into its parts again and those parts are then used to build a pagination filter.
For example, when ordering by "due_date", "pk", the cursor might contain the parts "2025-03-01" and ,"3". If that cursor is passed for after, the following filter would be constructed:
Q(due_date__gt="2025-03-01") | (Q(due_date="2025-03-01") & Q(pk__gt="3"))

Serializing and deserializing the field values to strings is dedicated to the model field implementation, ensuring maximal compatibility.

Other

  • During the implementation I discovered a bug in get_queryset_config. It is used in the code as if it did a "get or create" operation, setting the config on the QuerySet if not already present. However it did not actually do so.
    This is likely the cause for why this hack was implemented.

  • While writing tests I have noticed that strawberry_django.field(disable_optimization=True) had no effect when used on a top level field. I have fixed this.

  • Currently, the code lives in a separate relay_cursor.py file. I have chosen to do this for now to make the diff of this PR easier to parse. However I think the relay code should be refactored so that relay.py is removed and we have relay/__init__.py instead. Then the code can be split up into multiple files and still offer the same imports. What do you think?

I have fixed get_queryset_config, added specific tests for it and removed the (now unnecessary) hack in is_optimized_by_prefetching.

Types of Changes

  • Core
  • Bugfix
  • New feature
  • Enhancement/optimization
  • Documentation

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • I have tested the changes and verified that they work and don't break anything (as well as I can manage).

Copy link
Contributor

sourcery-ai bot commented Apr 9, 2025

Reviewer's Guide by Sourcery

This pull request introduces DjangoCursorConnection for efficient cursor-based pagination and fixes a bug in get_queryset_config, removing a related hack. It enhances performance for large datasets and improves the reliability of queryset optimization.

Updated class diagram for StrawberryDjangoQuerySetConfig

classDiagram
  class StrawberryDjangoQuerySetConfig {
    -optimized: bool
    -optimized_by_prefetching: bool
    -type_get_queryset_did_run: bool
    -ordering_descriptors: list[OrderingDescriptor] | None
  }
Loading

File-Level Changes

Change Details Files
Introduces DjangoCursorConnection to enable cursor-based pagination, enhancing performance for large datasets by using range queries instead of offset-based slicing.
  • Adds a new connection class DjangoCursorConnection.
  • Implements cursor-based pagination using range queries.
  • Inspects QuerySet ordering parameters to construct cursors.
  • Serializes and deserializes field values using model field implementations.
  • Adds DjangoCursorConnection to strawberry_django.relay_cursor.
  • Adds tests for cursor pagination in tests/relay/test_cursor_pagination.py.
  • Adds a milestone_cursor_conn field to the Query type in tests/projects/schema.py.
strawberry_django/optimizer.py
docs/guide/relay.md
tests/projects/schema.py
tests/relay/test_cursor_pagination.py
strawberry_django/relay_cursor.py
Fixes a bug in get_queryset_config to ensure it correctly sets the config on the QuerySet, and removes a related hack in is_optimized_by_prefetching.
  • Corrects get_queryset_config to properly set the config on the QuerySet.
  • Removes an unnecessary hack in is_optimized_by_prefetching.
  • Adds specific tests for get_queryset_config.
  • Adds optimized_by_prefetching to StrawberryDjangoQuerySetConfig.
  • Modifies is_optimized_by_prefetching to use get_queryset_config.
  • Updates mark_optimized_by_prefetching to set optimized_by_prefetching in get_queryset_config.
strawberry_django/optimizer.py
strawberry_django/queryset.py
tests/test_queryset_config.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!
  • Generate a plan of action for an issue: Comment @sourcery-ai plan on
    an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @diesieben07 - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider adding a section to the documentation that compares and contrasts ListConnection and DjangoCursorConnection, highlighting when each should be used.
  • It might be worth adding a note about the implications of strictly ordered QuerySets in the documentation.
Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟡 Testing: 1 issue found
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

## Cursor based connections

As an alternative to the default `ListConnection`, `DjangoCursorConnection` is also available.
It supports pagination through a Django `QuerySet` via "true" cursors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Clarify the meaning of "true cursors". Suggest using "offset-based cursors" vs "range-based cursors" to distinguish the approaches.

The term "true cursors" might be confusing to users. Using more descriptive terms like "offset-based cursors" and "range-based cursors" would improve clarity.

Suggested implementation:

It supports pagination through a Django `QuerySet` via range-based cursors.

`ListConnection` uses offset-based cursors (slicing) to achieve pagination, which can negatively affect performance for huge datasets,

return [getattr(obj, descriptor.attname) for descriptor in descriptors]


def build_tuple_compare(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider refactoring the complex, nested logic in build_tuple_compare and apply_cursor_pagination into smaller, well-named helper functions to improve readability and maintainability by reducing nested code blocks and improving readability..

Consider splitting the complex, nested logic into smaller helper functions. For example, the loop in build_tuple_compare mixes comparator and equality handling with nested conditionals. You can extract the "compare" logic into a helper:

def _compare_field(descriptor: OrderingDescriptor, field_value: Any, before: bool) -> Q:
    value_expr = Value(field_value, output_field=descriptor.order_by.expression.output_field)
    comparator = descriptor.get_comparator(value_expr, before)
    eq = descriptor.get_eq(value_expr)
    if comparator is None:
        return eq
    return comparator | (eq & Q())

def build_tuple_compare(
    descriptors: list[OrderingDescriptor],
    cursor_values: list,
    before: bool,
) -> Q:
    comparators = [
        _compare_field(descriptor, value, before)
        for descriptor, value in zip(reversed(descriptors), reversed(cursor_values))
    ]
    # Combine comparators with an 'AND' chain
    current = Q()
    for comp in comparators:
        current &= comp
    return current

Similarly, consider isolating parts of the slicing logic in apply_cursor_pagination into a small helper. For example:

def _apply_slice(qs: QuerySet, slice_: slice, related_field_id: Optional[str]) -> QuerySet:
    if related_field_id:
        offset = slice_.start or 0
        return apply_window_pagination(qs, related_field_id=related_field_id, offset=offset, limit=slice_.stop - offset)
    return qs[slice_]

# Then in apply_cursor_pagination replace:
if slice_ is not None:
    qs = _apply_slice(qs, slice_, related_field_id)

This approach keeps all functionality intact while reducing nested code blocks and improving readability.

@diesieben07
Copy link
Contributor Author

I don't know why the Typing test fails. pyright reports no problems for me locally and the reported file here is a file that I haven't touched.

@codecov-commenter
Copy link

codecov-commenter commented Apr 9, 2025

Codecov Report

Attention: Patch coverage is 99.60159% with 1 line in your changes missing coverage. Please review.

Project coverage is 88.99%. Comparing base (8898175) to head (866589f).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
strawberry_django/optimizer.py 94.44% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #730      +/-   ##
==========================================
+ Coverage   88.29%   88.99%   +0.70%     
==========================================
  Files          42       43       +1     
  Lines        3920     4161     +241     
==========================================
+ Hits         3461     3703     +242     
+ Misses        459      458       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@diesieben07
Copy link
Contributor Author

diesieben07 commented Apr 9, 2025

It's currently broken for nulls in ordering. I'm working on fixing that. Fixed

Copy link
Member

@bellini666 bellini666 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job here, really appreciate your contributions ❤️

Left a couple of comments/suggestions

Comment on lines +1478 to +1479
get_queryset_config(qs).optimized_by_prefetching = True
return qs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is this change safe? If I remember correctly, the reason we would do this was to mark a prefetch query as optimized, because that config there could be lost

Is that not an issue anymore with the changes here? I sincerely don't remember how it would be lost and even if we have a test that would break without this (hopefully we do)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding it is safe, yes. If I change is_optimized_by_prefetching to just return False then various tests fail, most importantly this one:

def test_nested_pagination(gql_client: utils.GraphQLTestClient):
but also various others that work with nested connections.

My understanding of why this code was written as it was before my change is as follows:

  1. get_queryset_config was broken in that it never set the configuration on the QuerySet. This wasn't caught because
    1. There were no tests for it.
    2. If it fails, that is only catastrophic for is_optimized_by_prefetching. The other things in the QuerySet config can cause the optimizer to optimize a QuerySet twice or they can cause a type's get_queryset to be called twice. Both of these are usually idempotent so don't cause any issues except for wasted CPU cycles.
  2. The above caused the get_queryset_config bug to never surface until is_optimized_by_prefetching was introduced and at that point the config disappearing was wrongly attributed to being related to prefetch_related when in reality get_queryset_config never worked.

Comment on lines +160 to +161
class AttrHelper:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why do we have this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to serialize the values of all order_by columns into the cursor, which is just a string. The fields could be of any type so we don't want to just str() them. Django model fields already have built-in serialization to and from strings, which Django uses for serializing and deserializing model instances (mainly for fixtures).

The API however is unfortunately somewhat clunky. Instead of Field.value_to_string(value) we get Field.value_to_string(obj) with obj being the model to look up the field on. So you could do something like

field_value_str = some_model_field.value_to_string(my_model_instance)

This works for fields which are present on the model, but users can order by arbitrary annotations:

qs.alias(name_up=Upper('name')).order_by('name_up')

Now we need to somehow turn the value of that annotation into a string via value_to_string - but there's no actual field for it. Hence we construct this AttrHelper and set the attribute that the field is looking for on it. This happens here:
https://github.com/diesieben07/strawberry-graphql-django/blob/39dcdd6e09aaf20c831707d6e01e4bfdacb24028/strawberry_django/relay_cursor.py#L179-L184

We can't use a plain object(), because you can't set arbitrary attributes on them.

Django's own ArrayField for PostgreSQL uses a similar technique:
https://github.com/django/django/blob/ab1b9cc1b38e8735979c817f42e8f54a5795c4f7/django/contrib/postgres/fields/array.py#L166-L167

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants